# Sample scraper - hec.gov.pk

This script goes through the simple steps of scraping a single website using a page from the website: hec.gov.pk (Higher Education Commission, Pakistan)

The script goes through the steps of importing the textual contents of a single page and some basic subsetting/selection.

## Prerequisites

The scripts uses the libraries `requests` and `scrapy` (especially the `Selector` tool). If you are running the script on the CALDISS jupyter server, these should already be installed.

Run cells by clicking on them and clicking `Shift+Enter` or by clicking the "Run" icon above.

In [66]:
import scrapy
import requests
from scrapy import Selector

# How web scraping works

Web scraping works by collecting the HTML of a webpage. HTML is the acutal content of the webpage consisting of codes, tags and text. A browser renders HTML to set up the layout, styles, images and so on but with web scraping we are (in most cases) collecting the raw HTML code.

## A webpage in the eyes of a scraper

This script uses this url as an example: https://www.hec.gov.pk/english/services/Pages/Research-Grants.aspx
In a browser, the webpage looks like this (clipped):

![hec_browser](hec_img1.png)

When vieved through a scraper, the same page looks like this (clipped):

![hec_raw](hec_img2.png)

You can always see the HTML content of a webpage by right-clicking and choosing "View source" (Firefox).

All the content appearing on a webpage through a browser is enclosed in tags.
For example, the content between the tags `<div id="s4-workspace">` and `</div>` renders as something visible on the page. 

When scraping, we use these tags to extract specific parts of a webpage, as the tags themselves do not hold any specific information.

# Accessing the web through python (selector objects)

We access the HTML of a website in python through so-called "selector objects". This allows us to navigate to specific tags and extract the textual content.
The code-cell below goes through 3 steps:
1. Setting the URL to be scraped
2. Sending a request and getting the content (the HTML) of that URL
3. Converting the HTML to a selector object

In [67]:
grants_url = 'https://www.hec.gov.pk/english/services/Pages/Research-Grants.aspx' #main URL
grants_html = requests.get(grants_url).content #extract HTML
grants_sel = Selector(text = grants_html) #create Selector object

## Scraping text from specific HTML tags

HTML work in a tree-structure. That means, a tag can contain other tags, which contains other tags and so on. With selector objects, we can either extract all text in a specific tags or navigate to a specific subtag.

For example, on the "Research Grant Programs" page on hec.gov.pk, the names of the individual grants is located within a `<div>` tag (a section) with the class: `ms-rtestate-field`.

This is all found out by inspecting the source code by right-clicking the webpage and choosing "view source" (Firefox).

If we wanted to navigate and extract the names of those grants, we would specify it as follows (using CSS locators):

In [68]:
grants_titles = grants_sel.css('div.ms-rtestate-field ::text').extract()

The `.` indicates a class. `div.ms-rtestate-field` thus corresponds to `<div class="ms-rtestate-field">` in the HTML. 

The `::text` specifies we want the textual content extracted from the tag - not the HTML.

The `.extract()` extracts the textual contents as a list, thus returning a python list rather than another Selector object

In [69]:
grants_titles

['Competitive Research Grants:',
 'Grant Challenge Fund (GCF)',
 'Local Challenge Fund (LCF)',
 'Technology Transfer Support Fund (TTSF)',
 'Innovative & Collaborative Research\xa0Grant (ICRG)',
 'National Research Programme for Universities (NRPU)',
 'Technology Development Fund (TDF)',
 'Problem Based Applied Interdisciplinary Research Programme (PBAIRP)',
 'Outstanding Research Awards',
 'Startup Research Grant (SRGP)',
 'Research Support Grants:',
 'Travel Grants for Presentation of Research Papers',
 'Grants for Organizing Seminars/Conferences',
 'Textbook\xa0&\xa0Monograph Writing',
 'HEC Library',
 'Mobility Grants:',
 'Pak-FRANCE Peridot Research Program',
 "PAK-TURK Researchers' Mobility Grant Programme",
 'Pakistan Program For Collaborative Research (PPCR)',
 'Pak-US Joint Research Program (with USA)',
 '\xa0',
 'Research for Innovation Grants:',
 'Establishment of Offices of Research Innovation & Commercialization (ORICS)',
 "Establishment of Business Incubation Centers (BIC

CSS locators always returns all tags that meet the criteria specified. Writing for example `div ::text` would return all text within all div tags.

**Hyperlinks**

Hyperlinks can also be extracted. Hyperlinks are stored as the `href` attribute in a `<a>` tag (a hyperlink). 

We extract hyperlinks using a CSS locator like the one below:

In [70]:
site_urls = grants_sel.css('a::attr(href)').extract()

The code above corresponds to an HTML tag like `<a href = 'www.website.com'>`. It is specified differently because we want to extract information from *within* the tag rather than from between two tags (fx `<p>` `</p>`).

The locator above returns all links on the site. 

In [71]:
site_urls

['/english',
 '#nogo',
 '#nogo',
 '#nogo',
 '/english',
 '/urdu',
 'javascript: {}',
 '/english/pages/home.aspx',
 '/english/universities/Pages/AJK/default.aspx',
 '#',
 '/english/universities/Pages/Accreditation.aspx',
 '/english/universities/pages/recognised.aspx',
 '/english/universities/Pages/AJK/Illegal-DAIs.aspx',
 '/english/universities/Pages/University-Ranking.aspx',
 '/english/universities/Pages/AJK/UniversitiesStatistics.aspx',
 '/english/services/students/Pages/GRP.aspx ',
 '/english/services/pages/default.aspx',
 '/english/services',
 '/english/services/students',
 '/english/services/faculty',
 '/english/services/universities',
 '/english/services/universities/Monitoring-Evaluation/Pages/default.aspx',
 '/english/services/universities/Pages/Policy.aspx',
 '/english/services/PhD-DB/Pages/default.aspx',
 '/english/scholarshipsgrants/pages/default.aspx',
 '/english/scholarshipsgrants',
 '/english/scholarshipsgrants/Pages/NationalScholarships.aspx',
 '/english/scholarshipsgrant

If we want to specify a specific section, we can first use the locator from before to create a new selector object and then extract the urls.

In [72]:
grants_subsel = grants_sel.css('div.ms-rtestate-field') #creates a new selector object from the specified locator.
grants_urls = grants_subsel.css('a::attr(href)').extract() #extracts href attributes from the selector.

With the locator above, we are only extracting links that are enclosed with the `<div>` tag with the class `ms-rtestate-field`.

In [73]:
grants_urls

['/english/services/faculty/GCF/Pages/default.aspx',
 '/english/services/faculty/LCF/Pages/default.aspx',
 '/english/services/faculty/TTSF/Pages/default.aspx',
 '/english/services/faculty/ICRPG/Pages/default.aspx',
 '/english/services/universities/nrpu/Pages/Introduction.aspx',
 '/english/services/students/TDF/Pages/Intro.aspx',
 '/english/services/faculty/PBAIRP/Pages/default.aspx',
 '/english/services/faculty/HEC%20Outstanding%20Research%20Awards/Pages/Introduction.aspx',
 '/english/services/faculty/Start-Up%20Research%20Grant%20Program/Pages/Introduction.aspx',
 '/english/services/faculty/HEC%20Research%20Travel%20Grant/Pages/HEC-Research-Travel-Grant.aspx',
 '/english/services/universities/GrantsforSeminarConferenceTraining/Pages/Introduction.aspx',
 '/english/services/faculty/MTBW/Pages/default.aspx',
 '/english/services/students/HEC-Library/Pages/default.aspx',
 '/english/services/faculty/peridot/Pages/default.aspx',
 '/english/services/faculty/PTRG/Pages/default.aspx',
 '/englis

We are still getting more links than just the grants, but we are getting closer!

Web scraping usually involves a lot of trial and error in order to get the locators right in order to specify the exact information.

Try changing or copying some of the cells above and extract different information.