# 7.3. Web Scrapping

Module M-227-04: Programming for Data Analytics

Instructor: prof. Dmitry Pavlyuk

## Web Scrapping

Web scraping (or data scraping) is a technique used to collect content and data from the internet. This data is usually saved in a local file so that it can be manipulated and analyzed as needed.

* If there is data on a website, then in theory, it is scrapable!
* Search engines regularly use web scraping to analyze, rank, and index their content.
* Web scraping is legal if you scrape data publicly available on the internet. __BUT__ further usages of scrapped information can be illegal (copyrights, intellectual property, sensitive information)
* Check whether the website is using the robots.txt protocol to communicate that scraping is prohibited, 


## Web Scrapping: main stages

- Stage 1. Finding a website with necessary data
- Stage 2. Check web scrapping limitations - website's term of use, robots.txt, etc.
- Stage 3. Inspect the internal page structure and find data to extract
- Stage 4. Write the code to receive the website's HTML
- Stage 5. Write the code to parse HTML data
- Stage 6. Store parsed data locally

## HTML

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript (JS).

In [1]:
html = """
<!doctype html>
<html lang=en>
    <head>
        <meta charset=utf-8>
        <title>Page title</title>
    </head>
    <body>
        <p>Body Content</p>
    </body>
</html>
"""

## HTML

<img src="https://upload.wikimedia.org/wikipedia/commons/5/5a/DOM-model.svg"/>
DOM illustration by Birger Eriksson (CC-BY-SA).

## HTML vs. XML

HTML and XML are related: HTML displays data and describes the structure of a webpage, whereas XML stores and transfers any data. HTML is a predefined language, while XML is a standard language that defines other languages.

* XML mainly focuses on data transfer while HTML is focused on presentation of the data
* XML is content driven whereas HTML is format driven.
* XML is Case sensitive while HTML is Case insensitive.
* XML provides namespaces support while HTML doesn't provide namespaces support.
* XML is strict for closing tag while HTML is not strict.
* XML tags are extensible whereas HTML has limited predefined tags.

## HTML parsing

## Pandas

__Pandas__ has embedded routines for parsing tables (see my presentations on Pandas)

In [2]:
import pandas as pd
ss_bmw = pd.read_html('https://www.ss.lv/lv/transport/cars/volkswagen/passat-b7/sell/', header=0)
main_table = sorted(ss_bmw,key=lambda x:len(x), reverse=True)[0]
main_table.dropna(axis='columns').head()

Unnamed: 0,Sludinājumi \tdatums.2,Gads,Tilp.,Nobrauk.,Cena
0,"Ļoti smuks, Bez rūsas - nav tikko pārkrāsots. ...",2012,1.6D,-,"6,990 €"
1,"VW Passat Highline 125 kW (170 hp). Farkops, d...",2011,2.0D,292 tūkst.,"7,600 €"
2,Lielisks ģimenes auto. Pārdodam sakarā ar jaun...,2010,2.0D,330 tūkst.,"7,400 €"
3,Auto labā tehniskā un vizuālā stāvoklī. Visas ...,2013,2.0D,335 tūkst.,"7,500 €"
4,Ļoti labā vizuālā un tehniskā stāvoklī. Pats b...,2011,2.0D,265 tūkst.,"10,500 €"


__But what if data on the website is not table-structured?__

## Beautiful Soup

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML.

In [3]:
!pip install beautifulsoup4



In [4]:
from bs4 import BeautifulSoup

## Simple Parsing

In [5]:
print(html)


<!doctype html>
<html lang=en>
    <head>
        <meta charset=utf-8>
        <title>Page title</title>
    </head>
    <body>
        <p>Body Content</p>
    </body>
</html>



In [6]:
parsed = BeautifulSoup(html, 'html')
print("Parsed Title:",parsed.select("title")[0].text)
print("Parsed Body:",parsed.select("html > body > p")[0].text)

Parsed Title: Page title
Parsed Body: Body Content


## Beautiful Soup: basics
- parsed.prettify() - prettier HTML
- parsed.title - first title tag
- parsed.p - first p tag
- parsed.p.em - first em tag within the first p tag

## Beautiful Soup: searching
- parsed.find_all(True) - all descendants
- parsed.find_all("a") - descendants by tag name (all _a_)
- parsed.find_all("a",limit=10) - first 10 _a_ tags
- parsed.find_all("a",recursive=False) - all _a_ immediately inside the tag
- parsed.find_all(href="https://tsi.lv/") - by attribute
- parsed.find_all(class_="post") - by class
- parsed.select("html > body > p") - CSS-like selector

## Getting website HTML

A website HTML is commonly received using the HTTP protocol (see previous presentation).

Check https://tsi.lv/robots.txt first!

In [7]:
import requests
url = "https://tsi.lv"
print("Requesting", url,'...')
response = requests.get(url)
print("Received; Response code:", response.status_code)

Requesting https://tsi.lv ...
Received; Response code: 200


## Getting website HTML

In [8]:
print("Received HTML:\n")
print(response.text[0:300],'\n...')

Received HTML:

<!doctype html>
<html lang="en-US">
<head>
	<meta charset="UTF-8">
		<meta name="viewport" content="width=device-width, initial-scale=1">
	<link rel="profile" href="https://gmpg.org/xfn/11">
	<meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1' / 
...


## Examples

## Example 1: TSI website external links

In [9]:
import requests
url = "https://tsi.lv"
print("Requesting", url,'...')
response = requests.get(url)
print("Received; Response code:", response.status_code)
soup = BeautifulSoup(response.text, 'html')
links = soup.find_all('a')
for link in links:
    try:
        if (link['href'].startswith('http') and ('tsi.lv' not in link['href'])):
            print(link['href'], link.text.strip())
    except: pass
   

Requesting https://tsi.lv ...
Received; Response code: 200
https://dinotrans.com/ 
http://www.topsailgroup.eu/ 
http://l-ekspresis.eu/en/ 
https://www.autoosta.lv/?lang=en 
https://www.evolution.com/ 
https://www.facebook.com/TSIpage Facebook
https://www.instagram.com/tsi_university/ Instagram
http://www.linkedin.com/company/transporta-un-sakaru-instit-ts Linkedin
http://vk.com/tsipage Vk
https://www.youtube.com/TSIRiga Youtube
https://apps.apple.com/us/app/tsi-schedule/id606137492 
https://play.google.com/store/apps/details?id=tsi.phonegap.schedule&hl=en 


## Example 2: TSI academic staff



In [10]:
profs = []
base_url = "https://tsi.lv/about-us/academic-staff/?jsf=jet-engine:staff&pagenum="
pages = 10
import re
for i in range(1,pages+1):
    url = base_url+str(i)
    print("Parsing",url,"...")
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html')
    blocks = soup.find('div',{'class':"jet-listing-grid__items"}).find_all('div', {"class":'jet-listing-grid__item'})
    for block in blocks:
        try:
            name = block.find('div', {"data-id": "50d7e9b"}).\
                find('div', {"class":'jet-listing-dynamic-field__content'}).text
        except: pass
        try:
            position = block.find('div', {"data-id": "26fd42e"}).\
                find('div', {"class":'jet-listing-dynamic-field__content'}).text.strip()
        except: pass
        try:
            email = block.find('div', {"data-id": "50539e7"}).\
                find('div', {"class":'jet-listing jet-listing-dynamic-link'}).find_all('a')[1].text.strip()
        except: pass
        profs.append({'name':re.search(r"(\S+)\s(\S{1})",name).group(), 
                      'position':position,
                      'email':"***".join(re.findall(r"(\S{2})\S+(@{1}\S+)",email)[0])})

Parsing https://tsi.lv/about-us/academic-staff/?jsf=jet-engine:staff&pagenum=1 ...
Parsing https://tsi.lv/about-us/academic-staff/?jsf=jet-engine:staff&pagenum=2 ...
Parsing https://tsi.lv/about-us/academic-staff/?jsf=jet-engine:staff&pagenum=3 ...
Parsing https://tsi.lv/about-us/academic-staff/?jsf=jet-engine:staff&pagenum=4 ...
Parsing https://tsi.lv/about-us/academic-staff/?jsf=jet-engine:staff&pagenum=5 ...
Parsing https://tsi.lv/about-us/academic-staff/?jsf=jet-engine:staff&pagenum=6 ...
Parsing https://tsi.lv/about-us/academic-staff/?jsf=jet-engine:staff&pagenum=7 ...
Parsing https://tsi.lv/about-us/academic-staff/?jsf=jet-engine:staff&pagenum=8 ...
Parsing https://tsi.lv/about-us/academic-staff/?jsf=jet-engine:staff&pagenum=9 ...
Parsing https://tsi.lv/about-us/academic-staff/?jsf=jet-engine:staff&pagenum=10 ...


## Example 2: TSI academic staff



In [11]:
profs_df = pd.DataFrame(profs)
profs_df = profs_df.drop_duplicates()
profs_df.head()

Unnamed: 0,name,position,email
0,Iyad A,Assoc. prof,Al***@tsi.lv
1,Jelena B,Lecturer,Ba***@tsi.lv
2,Dmitrij B,Assoc. prof,Bo***@tsi.lv
3,Evelina B,Assistant professor,Bu***@tsi.lv
4,Aleksandr B,Assistant professor,Bu***@tsi.lv


## Example 2: TSI academic staff

In [12]:
profs_df[-profs_df["position"].str.contains('uest')].groupby('position').count()

Unnamed: 0_level_0,name,email
position,Unnamed: 1_level_1,Unnamed: 2_level_1
Assistant professor,17,17
Assoc. prof,10,10
Emeritus professor,1,1
Lecturer,14,14
Professor,11,11
"State emeritus scientist, professor",1,1


In [13]:
profs_df.to_csv("tsi_profs.csv")

# Thank you