# Introduction Week 5

Here is the guide for this weekend in learning the following topics:

1. Overview of Web Applications & Web Scraping
2. Regular Expression
3. BeautifulSoup
4. Selenium

The classes shall be conducted with
1. Explanation
2. Online Demo
3. Q&A
4. Exercises !!!

## Tools - Python Libraries 

1. Requests
2. Regex
3. BeautifulSoup
4. Selenium


## Web Applications

**What are Web Applications?**

According to Wikipedia

*'In computing, a web application or web app is a client–server computer program that the client (including the user interface and client-side logic) runs in a web browser. Common web applications include webmail, online retail sales, online banking, and online auctions.'*
  
  
**Reference:** 

https://en.wikipedia.org/wiki/Web_application

https://developer.mozilla.org/en-US/docs/Learn


### Clients and servers

Computers connected to the web are called **clients** and **servers**. A simplified diagram of how they interact might look like this:

<p>
    



<img src="https://mdn.mozillademos.org/files/8973/Client-server.jpg">

<p>
<p>
    

* Clients are the typical web user's internet-connected devices (for example, your computer connected to your Wi-Fi, or your phone connected to your mobile network) and web-accessing software available on those devices (usually a web browser like Firefox or Chrome).

   
* Servers are computers that store webpages, sites, or apps. When a client device wants to access a webpage, a copy of the webpage is downloaded from the server onto the client machine to be displayed in the user's web browser.

<img src="./Images/InternetTransactions.png">

### Anatomy of a Web Page

HTML

CSS

Javascript


Show some examples 

In [None]:
webpage = '''
<!DOCTYPE html>
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p>Hello world!</p>
  </body>
</html>
'''

## Web Scraping

**Web Scraping** is a technique to extract the data from the web pages in an **automated way**.

A web scraping **script** can load and extract the **data** from multiple pages.

A web scraping script contains Python codes and required libraries to perform the task.

The first library needed is **Requests**

### Getting Started

**Install the Request library**

pip3 install request

OR

conda install -c conda-forge request

Ref:

https://anaconda.org/conda-forge/request


### Requests


**Requests** (handles HTTP sessions and makes HTTP requests).

import requests


In [None]:
import requests

url='https://www.thestar.com.my/news/nation/2020/03/23/covid-19-current-situation-in-malaysia-updated-daily'

page = requests.get(url)

page.status_code


### Status code

200 OK

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status


In [None]:
# print the returned page as string

page.text

In [None]:
# print the returned page as bytes

page.content

In [None]:
# print the returned page as bytes

page.encoding

### Compare

Open your Web Browser and compare the Source Code shown there and here


### Understanding the Web Page

Where is the ?

HTML

CSS

Javascript

Show the Web Browser Developmers Tools

### Reduce impact

Do not query the webpage all the time

Anti-bots / scrapers


In [7]:
# How to save HTML locally
import requests

def save_html(html, path):
    with open(path, 'wb') as f:
        f.write(html)
        
url = 'https://www.google.com'

r = requests.get(url)

save_html(r.content, 'google_com')

#print(r.content[:100])

In [8]:
# How to open.read HTML from a local file

def open_html(path):
    with open(path, 'rb') as f:
        return f.read()
    
    
html = open_html('google_com')


In [9]:
html

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-MY"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/logos/doodles/2021/wear-a-mask-save-lives-apr-6-6753651837109262.2-law.gif" itemprop="image"><meta content="Masks are still important. Wear a mask and save lives." property="twitter:title"><meta content="Masks are still important. Wear a mask and save lives. #GoogleDoodle" property="twitter:description"><meta content="Masks are still important. Wear a mask and save lives. #GoogleDoodle" property="og:description"><meta content="summary_large_image" property="twitter:card"><meta content="@GoogleDoodles" property="twitter:site"><meta content="https://www.google.com/logos/doodles/2021/wear-a-mask-save-lives-apr-6-6753651837109262-2xa.gif" property="twitter:image"><meta content="https://www.google.com/logos/doodles/2021/wear-a-mask-save-lives-apr-6-6753651837109262-2xa.gif" property="og:image"><meta content="1067" prop

In [10]:
# for ipython notebook display
from IPython.core.display import display, HTML

display(HTML(str(html)))

0,1,2
,(function(){var id=\'tsuid1\';document.getElementById(id).onclick = function(){if (this.form.q.value){this.checked = 1;if (this.form.iflsig)this.form.iflsig.disabled = false;}\nelse top.location=\'/doodles/\';};})();,Advanced search


### How to be a good scrapers/bots

Look for robots.txt at the root of the domain.

Website owner explicitly states what bots are allowed to do on their site



In [11]:
import requests

url = 'https://www.google.com/robots.txt'
 
user_agent = 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36'

headers={'User-Agent':user_agent}

r = requests.get(url, headers=headers)

print(r.content)

b'User-agent: *\nDisallow: /search\nAllow: /search/about\nAllow: /search/static\nAllow: /search/howsearchworks\nDisallow: /sdch\nDisallow: /groups\nDisallow: /index.html?\nDisallow: /?\nAllow: /?hl=\nDisallow: /?hl=*&\nAllow: /?hl=*&gws_rd=ssl$\nDisallow: /?hl=*&*&gws_rd=ssl\nAllow: /?gws_rd=ssl$\nAllow: /?pt1=true$\nDisallow: /imgres\nDisallow: /u/\nDisallow: /preferences\nDisallow: /setprefs\nDisallow: /default\nDisallow: /m?\nDisallow: /m/\nAllow:    /m/finance\nDisallow: /wml?\nDisallow: /wml/?\nDisallow: /wml/search?\nDisallow: /xhtml?\nDisallow: /xhtml/?\nDisallow: /xhtml/search?\nDisallow: /xml?\nDisallow: /imode?\nDisallow: /imode/?\nDisallow: /imode/search?\nDisallow: /jsky?\nDisallow: /jsky/?\nDisallow: /jsky/search?\nDisallow: /pda?\nDisallow: /pda/?\nDisallow: /pda/search?\nDisallow: /sprint_xhtml\nDisallow: /sprint_wml\nDisallow: /pqa\nDisallow: /palm\nDisallow: /gwt/\nDisallow: /purchases\nDisallow: /local?\nDisallow: /local_url\nDisallow: /shihui?\nDisallow: /shihui/\nDi

In [12]:
# Change to Byte to String
str(r.content,'utf-8')

'User-agent: *\nDisallow: /search\nAllow: /search/about\nAllow: /search/static\nAllow: /search/howsearchworks\nDisallow: /sdch\nDisallow: /groups\nDisallow: /index.html?\nDisallow: /?\nAllow: /?hl=\nDisallow: /?hl=*&\nAllow: /?hl=*&gws_rd=ssl$\nDisallow: /?hl=*&*&gws_rd=ssl\nAllow: /?gws_rd=ssl$\nAllow: /?pt1=true$\nDisallow: /imgres\nDisallow: /u/\nDisallow: /preferences\nDisallow: /setprefs\nDisallow: /default\nDisallow: /m?\nDisallow: /m/\nAllow:    /m/finance\nDisallow: /wml?\nDisallow: /wml/?\nDisallow: /wml/search?\nDisallow: /xhtml?\nDisallow: /xhtml/?\nDisallow: /xhtml/search?\nDisallow: /xml?\nDisallow: /imode?\nDisallow: /imode/?\nDisallow: /imode/search?\nDisallow: /jsky?\nDisallow: /jsky/?\nDisallow: /jsky/search?\nDisallow: /pda?\nDisallow: /pda/?\nDisallow: /pda/search?\nDisallow: /sprint_xhtml\nDisallow: /sprint_wml\nDisallow: /pqa\nDisallow: /palm\nDisallow: /gwt/\nDisallow: /purchases\nDisallow: /local?\nDisallow: /local_url\nDisallow: /shihui?\nDisallow: /shihui/\nDis

In [13]:
# Print the newline

print(str(r.content,'utf-8'))


User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&
Allow: /?hl=*&gws_rd=ssl$
Disallow: /?hl=*&*&gws_rd=ssl
Allow: /?gws_rd=ssl$
Allow: /?pt1=true$
Disallow: /imgres
Disallow: /u/
Disallow: /preferences
Disallow: /setprefs
Disallow: /default
Disallow: /m?
Disallow: /m/
Allow:    /m/finance
Disallow: /wml?
Disallow: /wml/?
Disallow: /wml/search?
Disallow: /xhtml?
Disallow: /xhtml/?
Disallow: /xhtml/search?
Disallow: /xml?
Disallow: /imode?
Disallow: /imode/?
Disallow: /imode/search?
Disallow: /jsky?
Disallow: /jsky/?
Disallow: /jsky/search?
Disallow: /pda?
Disallow: /pda/?
Disallow: /pda/search?
Disallow: /sprint_xhtml
Disallow: /sprint_wml
Disallow: /pqa
Disallow: /palm
Disallow: /gwt/
Disallow: /purchases
Disallow: /local?
Disallow: /local_url
Disallow: /shihui?
Disallow: /shihui/
Disallow: /products?
Disallow: /product_
Disallow: /p

In [14]:
import requests

url = 'https://www.google.com/robots.txt'
 
user_agent = 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36'

headers={'User-Agent':user_agent}

r = requests.get(url, headers=headers)

print(r.text)

User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&
Allow: /?hl=*&gws_rd=ssl$
Disallow: /?hl=*&*&gws_rd=ssl
Allow: /?gws_rd=ssl$
Allow: /?pt1=true$
Disallow: /imgres
Disallow: /u/
Disallow: /preferences
Disallow: /setprefs
Disallow: /default
Disallow: /m?
Disallow: /m/
Allow:    /m/finance
Disallow: /wml?
Disallow: /wml/?
Disallow: /wml/search?
Disallow: /xhtml?
Disallow: /xhtml/?
Disallow: /xhtml/search?
Disallow: /xml?
Disallow: /imode?
Disallow: /imode/?
Disallow: /imode/search?
Disallow: /jsky?
Disallow: /jsky/?
Disallow: /jsky/search?
Disallow: /pda?
Disallow: /pda/?
Disallow: /pda/search?
Disallow: /sprint_xhtml
Disallow: /sprint_wml
Disallow: /pqa
Disallow: /palm
Disallow: /gwt/
Disallow: /purchases
Disallow: /local?
Disallow: /local_url
Disallow: /shihui?
Disallow: /shihui/
Disallow: /products?
Disallow: /product_
Disallow: /p

**Using r.text**

Requests makes educated guesses about the encoding of the response based on the HTTP headers.

The text encoding guessed by Requests is used when you access r.text. 

You can find out what encoding Requests is using, and change it, using the r.encoding property:
