# Scraping University Researchers

As a project from my Master's in Big Data Analysis I've performed my first solo web scraping. The task consisted on extracting specific information on the university's members of the different research groups. [Universitat de les Illes Balears](https://www.uib.es/es/), *aka* UIB.

## Aim:
+ Identification and extraction of the researchers' information:
    + Name
    + Gender 
    + Researcher Level
    + Role
    + Title
    + CV
    + Research Group

### Part 1

The researcher's data is available [here](https://www.uib.eu/research/groups/). The *All* cateogory displays all the research groups in a easier format so the scraping will start at:
  + [url_en](https://www.uib.eu/research/groups/grups_area/id_area=-1%2526npag=1) english version.
  + [url_cat](https://www.uib.cat/recerca/estructures/grups/grups_area/id_area=-1%2526npag=1) catalan version.
  + [url_sp](https://www.uib.es/es/recerca/estructures/grups/grups_area/id_area=-1) spanish version.


Then the first section consists of two parts: 
  1. Getting into an specific research group's web page and find the list of members.
  2. Identify when there are no more pages listing the departments to stop the scrapping.
  
### Part 2
 
Inside the research team's main page, the researchers are divided in 3 levels: 
 + **Main Resercher**. Usually just one high level researcher.
 + **Members**. Many mid-high level researchers
 + **Collaboratos**. Many reserchers with distinct professional levels.

The second part consists of identifying the following information from each member:
 + Name
 + Gender
 + Category of researcher in the team
 + Role at the univeristy (University Relationship)
 + Title

### Part 3

The last scraped data is a summary of the researcher's *curriculum vitae*. Some members don't have a personal university web page, hence there won't be any cv extraction in those cases. Moreover, not all researchers have their *cv* in all languages, so depending on the language some *cv* will be missing to. 

### Ensambling

At the end of the notebook the functions and procedures defined in the previous sections are merged together in order to complete the process of scraping all researchers data. 

The language of scrapping can be modifyed by replacing the initial url by the corresponding with the desired language. This project has been done with the catalan version of the web page in order to be able to identify the researchers gender, in spanish could be known too. 

For non-catalan or spanish speakers, note that the female version of the personal title is like the male's one, adding an *a* at the end. Eg: Dr./Dra., Sr./Sra. Hence the gender of the researcher can be easily identifyed. 

### Beautiful Soup

In [1]:
from bs4 import BeautifulSoup
import requests
import json

## Part 1

In [2]:
# Research Groups
# url_en = "https://www.uib.eu/research/groups/grups_area/id_area=-1" english
url_cat = "https://www.uib.cat/recerca/estructures/grups/grups_area/id_area=-1%2526npag=1"
last_cat = "https://www.uib.cat/recerca/estructures/grups/grups_area/id_area=-1%2526npag=15"

In [3]:
def load_page(url):
    web = requests.get(url)
    if web.status_code == 200:
        return BeautifulSoup(web.text, 'html5')
    else:
        print("Something went wrong")
        return None

In [41]:
cat = load_page(url_cat)
# cat

From the list of departments, the useful information is found inside a *div* elmenet from the class *uib_style_filanuica*

In [5]:
container = cat.find("div", class_ = "uib_style_filaunica")
# container

Once in the page and with the information selected, the departments are oganized in a **list of items**. Each items contains the name of the research group, or department, as well as it's url.

In [33]:
groups = container.find_all("li")
# groups

In [7]:
groups[0].get_text(strip=True)

'Genètica humana'

In [8]:
groups[0].find("a").get("href")

'https://www.uib.cat/recerca/estructures/grups/grup/GENEHUMA/'

The following function extracts the data per department, returns the **name** and an **url** 

In [9]:
def extract_group(dpt):
    name = dpt.text.strip() #same as get_text(strip=True)
    url = dpt.find("a").get("href")
    return name, url

In [10]:
extract_group(groups[0])

('Genètica humana',
 'https://www.uib.cat/recerca/estructures/grups/grup/GENEHUMA/')

Each group's web page contains the general information. In addition, in one of the side menus the **Research Team** url is found. This url appears to be the same as the general one, adding the following peace string: _"/equip/index.html"_. 

Identify if a following page is available

In [11]:
def next_page(page):
    return page.find("img", title = "Pàg. Següent") is not None

In [12]:
next_page(cat)

True

In [14]:
last = load_page(last_cat)
next_page(last) 

False

## Part 2

In each team's page there are three possible members categories. Some members of the collaborations category do not inform about their personal web page, hence their short cv summary won't be available.

In [15]:
group_name, group_url = extract_group(groups[1])
members_url = group_url + "/equip/index.html"
members_page = load_page(members_url)
group_name

'Microbiologia (Microbio)'

Again the information is organized inside a *div* element, with the *itemprop* attribute holding the value *mainContentOfPage*

In [16]:
content = members_page.find("div", itemprop = "mainContentOfPage")
# print(content.prettify())

In [17]:
# Member's level
levels = content.find_all("h3")
levels

[<h3>Investigador principal</h3>, <h3>Membres</h3>, <h3>Col·laboradors</h3>]

The list of each category's members is found after the *h3* element, thus I've searched for the siblings of the respective *h3* to obtain the members.

In [18]:
main_res = levels[1].find_next_sibling().find_all("li")
main_res
pers = main_res[9]
pers
pers.find("a") == None 

True

In [19]:
colabs = levels[2].find_next_sibling().find_all("li")
colabs

[<li><a class="link_fitxa" href="https://www.uib.cat/personal/ABjI0NDcyOQ/">Sra. Maria Cañellas Cifre</a> (Tècnica mitjana)</li>,
 <li><a class="link_fitxa" href="https://www.uib.cat/personal/ABjI4NTAxMg/">Sr. Víctor Fernández Juárez</a> (Tècnic especialista)</li>]

In [20]:
person = colabs[0].text
person

'Sra. Maria Cañellas Cifre (Tècnica mitjana)'

Given each member with the first distinctive I've detected either it's a male or a female, the second part is the full name and the last part in between parenthesis is the member's role at the university.

In [21]:
p1, p2 = person.split("(")
p1 = p1.strip().split() # Divide the name and category
p1

['Sra.', 'Maria', 'Cañellas', 'Cifre']

In [22]:
name = " ".join(p1[1:])
name

'Maria Cañellas Cifre'

In [23]:
def extract_gender(x):
    if x[2]== "a":
        return "F" #Female
    else:
        return "M" #Male
extract_gender(p1[0])

'F'

In [24]:
person0 = colabs[0]
pers_url = person0.find("a").get("href") #personal url
pers_url

'https://www.uib.cat/personal/ABjI0NDcyOQ/'

### Part 3
The last part of the project consists of getting in to the person's personal web page and extract the cv summary in case there is one.

In [25]:
pers_web = load_page(pers_url)
cv = pers_web.find("div", id = "cv_breve")
print(cv)

None


This short CV information can be available in catalan, spanish and english, yet some researchers don't have the summary in all languages. Hence, this posible situation needs to be indenfied to solve any kind of potential error.  

In [26]:
def extract_cv(url):
    personal_page = load_page(url)
    if personal_page.find("div", class_ = "uib_style_nolanguageversion") is not None: #when is not found in a specific language
#         print("No CV in this language")
        return None
    if personal_page.find("div", id = "cv_breve") is None:
#         print("No CV for this researcher")
        return None
    else:
        return personal_page.find("div", id = "cv_breve").text

In [27]:
extract_cv(pers_url)
extract_cv('https://www.uib.cat/personal/AAzM1OA/')

"Soy licenciado con Grado en Biología por la Universidad de las Islas Baleares (1989) y Doctor en Biología por la misma Universidad (1994). He sido profesor de Microbiología de la UIB desde el curso académico 1994-95 con diferentes categorías contractuales, habiendo realizado estancias puntuales de investigación en otros centros y participado en una campaña oceanográfica Antártica. Desde el curso académico 2008-09 soy Profesor Titular de Microbiología y estoy adscrito al Departamento de Biología de la UIB. He estado vinculado a trabajos de investigación en temáticas diferentes desde 1988: filogenia y citogenética de coleópteros (alumno colaborador, 1988-89), fijación de nitrógeno en Azotobacter vinelandii (becario predoctoral FPI, 1990-1994), degradación de hidrocarburos aromáticos en Pseudomonas (1995-actualidad) y miembros del grupo Roseobacter (2006-actualidad), así como ecofisiología del grupo Roseobacter en aguas costeras antropogenizadas (2001-actualidad). Desde el año 2010 he em

In [28]:
def extract_researcher(item, level):
    url = item.find("a")
    text = item.text
    parts = text.split("(")
    p1 = parts[0].strip().split()
    cv = None
    if url is not None:
        personal_url = url.get("href")
        cv = extract_cv(personal_url)
    return {
        'name' : " ".join(p1[1:]),
        'gender' : extract_gender(p1[0]),
        'reasearcher level' : level.get_text(),
        'role' : parts[1][:-1],
        'title' : p1[0],
        'cv' : cv
    } 

In [29]:
dict = extract_researcher(person0, levels[0])
dict

{'name': 'Maria Cañellas Cifre',
 'gender': 'F',
 'reasearcher level': 'Investigador principal',
 'role': 'Tècnica mitjana',
 'title': 'Sra.',
 'cv': None}

## Ensambling

In [37]:
def scrapping_researchers(url):
    i = 1
    researchers = []
    main = url[:-1] + '{}'                # extract the main url for surfing to further pages.
    groups_list_page = load_page(url)     # first url.
    stop = False                          # stoper
    while stop is False:    # last main page scraped is the one with no pages following it.
        print("Scrapping page " + str(i))
        content1 = groups_list_page.find("div", class_ = "uib_style_filaunica")
        groups = content1.find_all("li")
        for group in groups:
            g_name, g_url = extract_group(group)
            content2 = members_page.find("div", itemprop = "mainContentOfPage")
            levels = content2.find_all("h3")
            for lvl in levels:
                members = lvl.find_next_sibling().find_all("li")
                for member in members:
                    researcher = extract_researcher(member, lvl)
                    researcher["research group"] = g_name
                    researchers.append(researcher)
        if next_page(groups_list_page) == False:
            stop = True
        else: 
            i += 1
            url = main.format(i)
            groups_list_page = load_page(url)
    return researchers

In [42]:
researchers_uib = scrapping_researchers(url_cat)

Scrapping page 1
Scrapping page 2
Scrapping page 3
Scrapping page 4
Scrapping page 5
Scrapping page 6
Scrapping page 7
Scrapping page 8
Scrapping page 9
Scrapping page 10
Scrapping page 11
Scrapping page 12
Scrapping page 13
Scrapping page 14
Scrapping page 15


In [43]:
researchers_uib[0]

{'name': 'Rafael Bosch Zaragoza',
 'gender': 'M',
 'reasearcher level': 'Investigador principal',
 'role': "Professor titular d'universitat",
 'title': 'Dr.',
 'cv': "Soy licenciado con Grado en Biología por la Universidad de las Islas Baleares (1989) y Doctor en Biología por la misma Universidad (1994). He sido profesor de Microbiología de la UIB desde el curso académico 1994-95 con diferentes categorías contractuales, habiendo realizado estancias puntuales de investigación en otros centros y participado en una campaña oceanográfica Antártica. Desde el curso académico 2008-09 soy Profesor Titular de Microbiología y estoy adscrito al Departamento de Biología de la UIB. He estado vinculado a trabajos de investigación en temáticas diferentes desde 1988: filogenia y citogenética de coleópteros (alumno colaborador, 1988-89), fijación de nitrógeno en Azotobacter vinelandii (becario predoctoral FPI, 1990-1994), degradación de hidrocarburos aromáticos en Pseudomonas (1995-actualidad) y miembr

The previous example shows the selection of scraped values from each UIB's researcher.

Lastly, I'm going to use the following functions to load the scrapped data into a *.json* file and to load it back. As the catalan and spanish languages requier, the data is *utf8* encoded.

In [44]:
def dump_data(data, filename):
    with open(filename, "w", encoding = 'utf8') as out_file:
        json.dump(data, out_file, ensure_ascii = False)
        
def load_data(filename):
    data = None
    with open(filename, "r", encoding = 'utf8') as in_file:
        data = json.load(in_file)
    return data

In [45]:
dump_data(researchers_uib, 'Researchers_UIB_Cat.json')

## Results

In this last section of the project I show two posible formats in which the scraped data can be used. The first one, is directly the format of the *.json* file, which is a list of dictionaries. The second format is a pandas dataframe which is one of the best option for a potential data analysis. 

Lastly, I would like to add that regarding my university's web page, the different language versions work perfectly fine, however some fields do not correspond to the specific language. The researcher of the following example, Rafael Bosch Zaragoza, has his *cv* in spanish for both the catalan and the spanish version of the web page. I guess that's not a huge problem as the catalan speakers are mainly bilingual so they can understan both languages. Most probably the researchers are the ones adding their *curriculum*'s summaries, yet I belive the university should have enough resources to solve this nuances.

In [46]:
data = load_data("Researchers_UIB_Cat.json")
data[0]

{'name': 'Rafael Bosch Zaragoza',
 'gender': 'M',
 'reasearcher level': 'Investigador principal',
 'role': "Professor titular d'universitat",
 'title': 'Dr.',
 'cv': "Soy licenciado con Grado en Biología por la Universidad de las Islas Baleares (1989) y Doctor en Biología por la misma Universidad (1994). He sido profesor de Microbiología de la UIB desde el curso académico 1994-95 con diferentes categorías contractuales, habiendo realizado estancias puntuales de investigación en otros centros y participado en una campaña oceanográfica Antártica. Desde el curso académico 2008-09 soy Profesor Titular de Microbiología y estoy adscrito al Departamento de Biología de la UIB. He estado vinculado a trabajos de investigación en temáticas diferentes desde 1988: filogenia y citogenética de coleópteros (alumno colaborador, 1988-89), fijación de nitrógeno en Azotobacter vinelandii (becario predoctoral FPI, 1990-1994), degradación de hidrocarburos aromáticos en Pseudomonas (1995-actualidad) y miembr

This data can also be easily loaded into a pandas dataframe with [read_json()](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html).

In [47]:
import pandas as pd

df = pd.read_json("Researchers_UIB_Cat.json", encoding='utf8')
df.head(10)

Unnamed: 0,name,gender,reasearcher level,role,title,cv,research group
0,Rafael Bosch Zaragoza,M,Investigador principal,Professor titular d'universitat,Dr.,Soy licenciado con Grado en Biología por la Un...,Genètica humana
1,Jorge Lalucat Jo,M,Membres,Professor emèrit,Dr.,,Genètica humana
2,Elena Isabel García-Valdés Pukkits,F,Membres,Catedràtica d'universitat,Dra.,Fecha del CVA 15.09.2014 Elena García-Valdés P...,Genètica humana
3,Antonio Bennàsar Figueras,M,Membres,Professor titular d'universitat,Dr.,"Antoni Bennasar Figueras (d'ara endavant, ABF)...",Genètica humana
4,Margarita Gomila Ribas,F,Membres,Professora titular d'universitat,Dra.,Soc Llicenciada en Biologia (1999) i en Bioqui...,Genètica humana
5,Balbina Nogales Fernández,F,Membres,Professora titular d'universitat,Dra.,Sóc llicenciada en Ciències Biològiques por la...,Genètica humana
6,Joseph Alexander Christie de Oleza,M,Membres,Contractat Ramón y Cajal,Dr.,Dr Joseph Christie-Oleza (JC-O) has recently m...,Genètica humana
7,Antonio Busquets Bisbal,M,Membres,Tècnic superior,Dr.,Mi formación como investigador empezó con dos ...,Genètica humana
8,Maria Magdalena Mulet Pol,F,Membres,Tècnica superior,Dra.,,Genètica humana
9,Ramon Rossello Mora,M,Membres,Professor associat,Dr.,,Genètica humana


In [48]:
from tabulate import tabulate
print(tabulate(df.tail(1), tablefmt = 'pipe', headers = 'keys'))

|      | name                    | gender   | reasearcher level   | role                | title   | cv   | research group                                           |
|-----:|:------------------------|:---------|:--------------------|:--------------------|:--------|:-----|:---------------------------------------------------------|
| 2645 | Víctor Fernández Juárez | M        | Col·laboradors      | Tècnic especialista | Sr.     |      | Unitat de Gràfics i Visió per Ordinador i IA (UGiVpOeIA) |
