<a href="https://colab.research.google.com/github/MCanela-1954/DataSci_Course/blob/main/%5BDATA-03E%5D%20Example%20-%20IESE%20faculty%20data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [DATA-03E] Example - IESE faculty data

## Introduction

In this example, we are going to scrape data on the professors from the website of IESE Business School. The tools used are taken from the Python packages **Requests** and **Beautiful Soup**. Both packages are available in Colab notebooks. Requests is included in the Anaconda distribution, but Beautiful Soup is not. You can install it in your computer by running `pip install bs4` in the shell or in a Jupyter app.

The basic information on the IESE faculty is posted on seven webpages. Except the last one, each of these pages contains information on 20 professors. The URL for the first page is `https://www.iese.edu/search/professors/1/` (if you omit the number at the end you will get also the first page). The rest of the pages are obtained by increasing the counter. We will see first how to work on the first page, and then we will loop over the rest of the pages.


## The target data

We aim at capturing the following fields:

* `name`, the complete name of the professor. Example: Javier
Aguirreamalloa Arizaga.

* `job`, the job title of the professor. Example: Associate Professor of the Practice of Management of Financial Management.

* `link`, the URL for the professor's personal page. Example: `https://www.iese.edu/faculty-research/faculty/javier-aguirreamalloa-arizaga/`.

* `picture`, the URL for the professor's picture. Example: `https://www.iese.edu/wp-content/uploads/2018/11/Aguirreamalloa_Javier-1.jpg`.

## Capturing the source code

We import Requests as:

In [None]:
import requests

To get source code of a web page, we apply the Requests function `get()` to the URL of that page. When the request is accepted, as in this case, this function returns an object of a special type (type `requests.models.Response`). The attribute `.text` of this object is a string which, for an ordinary webpage, is the source HTML code of that page.

In [None]:
html_str = requests.get('https://www.iese.edu/search/professors/1/').text

Now, `html_str` is a string containing the source code of the IESE faculty first page.

## Parsing the source code

To **parse** the source code, learning the tree structure it conveys, we use the function `BeautifulSoup()`, from the package `bs4` (Beautiful Soup, version 4). We import this function with:

In [None]:
from bs4 import BeautifulSoup

`BeautifulSoup()` converts the string `html_str` to a "soup" object:

In [None]:
soup = BeautifulSoup(html_str, 'html.parser')

## The first professor

In web scraping jobs, we take advantage of the fact that web pages posting information units in a systematic way have a repetitive structure, made of a set of HTML elements with the same names and attributes values. This is, precisely, what allows IESE to update the pages in a programmatic way, in order to cope with the changes in the faculty composition.

To use the methods `.find()` and `.find_all()`, we need to know the name of the HTML element and, sometimes, some of the attributes. How can we find this? There are many ways, and every practitioner has his/her own cookbook. We use here a simple approach, based on the browser tools. More specifically, the *Inspect* tool of the browser.

In the browser, we right-click on the area where the information sought is stored. A contextual menu pops up, in which we select *Inspect*. This opens the panel *Developer Tools*. The *Elements* window in this panel shows a view of the source code corresponding to the area on which we have clicked.

Let us do this where the information on the first professor is displayed. This is a rectangular area, with the picture on top, and the name and job below. In the *Elements* window, this is the first of a series of elements with the same start tag:

```
<div class="col-12 col-md-4 col-lg-3 employee-card-box">
 ```

The method `.find()` gives the first of these elements, which corresponds to the first professor.

In [None]:
block1 = soup.find('div', 'col-12 col-md-4 col-lg-3 employee-card-box')
block1

The four pieces of information we wish to capture come in the following descendants:

* An `a` element of class `employee-card-link` contains the link to the professor's page.

* The link to the professor's picture appears twice: in an `img` element of class `image-fluid lazyload`, and in `noscript` element. We will use the `img` element, leaving aside the `noscript` element, which is included just in case Javascript is not available for your browser.

* A `p` element of class `employee-card__description__name` contains the professor's name. A `p` tag is about the same as a `div`, the only difference  being that a `p` element is meant to contain paragraphs of text and a `div` element can contain anything.

* A `p` element of class `employee-card__description__job` contains the professor's job.

So, we can use `.find()` to capture these four values. Let us follow the order in which have specified them above.

In [None]:
name = block1.find('p', 'employee-card__description__name')
name

We can extract the name with `.string`:

In [None]:
name = name.string
name

We repeat this for the job.

In [None]:
job = block1.find('p', 'employee-card__description__job').string
job

Now, the link for the personal page comes as an `href` attribute value. So, we use a diferent procedure.

In [None]:
link = block1.find('a', 'employee-card-link')['href']
link

We use for the image the same procedure as for the link. Note that `img` element does not have an end tag. This is an exception, which is explained by the fact that these elements never contain text.

In [None]:
picture = block1.find('img', 'image-fluid lazyload')['data-src']
picture

We can pack this information in various ways. Let us follow a JSON style, using a Python dictionary. We write a function for this task.

In [None]:
def get_block_info(block):
    name = block.find('p', 'employee-card__description__name').string
    job = block.find('p', 'employee-card__description__job').string
    link = block.find('a', 'employee-card-link')['href']
    picture = block.find('img', 'image-fluid lazyload')['data-src']
    dict = {'name': name, 'job': job, 'link': link, 'picture': picture}
    return dict

Let us see how this works on the first block

In [None]:
get_block_info(block1)

This is over. Next, we loop over the 20 blocks.

## The first page

We create first a list of 20 blocks, to loop over. This is easily done, by replacing `.find()` by `.find_all()`.

In [None]:
blocks = soup.find_all('div', 'col-12 col-md-4 col-lg-3 employee-card-box')

This should be a list of 20 HTML elements.

In [None]:
len(blocks)

To loop over these 20 blocks, we use s **list comprehension**.

In [None]:
data = [get_block_info(block) for block in blocks]

Let us check.

In [None]:
data[0]

In [None]:
data[-1]

We are done here, Javier is on top and Veronica at the bottom. We go now for the other six pages.

## The complete faculty

A simple loop over the seven pages will do the job. We start with an empty list and append the data from every page to the current list.

In [None]:
data = []
for i in range(1, 8):
    html_str = requests.get(f'https://www.iese.edu/search/professors/{i}/').text
    soup = BeautifulSoup(html_str, 'html.parser')
    blocks = soup.find_all('div', 'col-12 col-md-4 col-lg-3 employee-card-box')
    newdata = [get_block_info(block) for block in blocks]
    data = data + newdata

In [None]:
len(data)

The last item should contain the data on the last professor, Christoph. Indeed:

In [None]:
data[-1]

We can manage these data in many ways. For instance, you may wish to have them as a Pandas data frame.

In [None]:
import pandas as pd
df = pd.DataFrame(data)
df.info()

In [None]:
print(df.head())

In [None]:
print(df.tail())

You can export this to a CSV file, and save it in MyDrive, as follows.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df.to_csv('/content/drive/MyDrive/faculty.csv', index=False)

Alternatively, you can export the data to a JSON file, which is trivial from Pandas.

In [None]:
df.to_json('/content/drive/MyDrive/faculty.json', index=False)

## Homework

1. Export the faculty data to a table of the SQLite `iese.db` database created in the example DATA-02E.

2. Query this table to extract a list of the associate professors. You can help yourself with Gemini to refresh your SQL.