# W5-01 : Requests, Regex, BeautifulSoup, Selenium

## Exercise(s): Learn how to perform Data Scraping


Objective: Learn Regex, BeautifulSoup, Selenium

Competencies:
-	Participants will be able to use Numpy

Tools: Python, Anaconda, Jupyter

Analysis case study: -

## Day 1: Basics of Web Scraping

1. Requests
2. Regex
3. BeautifulSoup
4. Selenium

## Day 2: Social Media Scraping

1. Facebook
2. Twitter
3. Instagram
4. Reddit


### Anatomy of a Web Page

HTML(Hyper Text Markup Language)

CSS(Cascading Style Sheets)

Javascript


Show some examples 

In [None]:
webpage = '''
<!DOCTYPE html>
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p>Hello world!</p>
  </body>
</html>
'''

## Web Scraping

**Web Scraping** is a technique to extract the data from the web pages in an **automated way**.

A web scraping **script** can load and extract the **data** from multiple pages.

A web scraping script contains Python codes and required libraries to perform the task.

The first library needed is **Requests**


### Getting Started

**Install the Request library**

pip3 install request

OR

conda install -c conda-forge request

### Chapter 1: Requests


**Requests** (handles HTTP sessions and makes HTTP requests).

import requests


In [None]:
import requests

url='https://www.thestar.com.my/news/nation/2020/03/23/covid-19-current-situation-in-malaysia-updated-daily'

page = requests.get(url)

page.status_code


### Status code

200 OK

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status


In [None]:
# print the returned page as string
page.text

In [None]:
# print the returned page as bytes
# converting to bytes allows you to convert to any other types
page.content

In [None]:
page.encoding

### Compare

Open your Web Browser and compare the Source Code shown there and here


### Understanding the Web Page

Where is the ?

HTML

CSS

Javascript

Show the Web Browser Developmers Tools

### Reduce impact

Do not query the webpage all the time

Anti-bots / scrapers


In [None]:
# How to save HTML locally
import requests

def save_html(html, path):
    with open(path, 'wb') as f:
        f.write(html)
        
url = 'https://www.google.com'

r = requests.get(url)

save_html(r.content, 'google_com')

# print(r.content[:300])

In [None]:
# How to open.read HTML from a local file

def open_html(path):
    with open(path, 'rb') as f:
        return f.read()
    
    
html = open_html('google_com')


In [None]:
html

### How to be a good scrapers/bots

Look for robots.txt at the root of the domain.

Website owner explicitly states what bots are allowed to do on their site



In [None]:
import requests

url = 'https://www.google.com/robots.txt'
 
user_agent = 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36'

headers={'User-Agent':user_agent}

r = requests.get(url, headers=headers)

print(r.content)


# User Agent String: This user agent string is then passed in the headers of the HTTP request.

# Headers: The headers dictionary contains the user agent information under the 'User-Agent' key. This is included in the HTTP request to 
# identify the client to the server.

# HTTP GET Request: requests.get(url, headers=headers) sends an HTTP GET request to the specified URL with the provided headers.

In [None]:
# Change to Byte to String
str(r.content,'utf-8')

In [None]:
# Print the newline
print(str(r.content,'utf-8'))

In [None]:
import requests

url = 'https://www.google.com/robots.txt'
 
user_agent = 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36'

headers={'User-Agent':user_agent}

r = requests.get(url, headers=headers)

print(r.text)

**Using r.text**

Requests makes educated guesses about the encoding of the response based on the HTTP headers.

The text encoding guessed by Requests is used when you access r.text. 

You can find out what encoding Requests is using, and change it, using the r.encoding property

# Chapter 2. REGular EXpressions(REGEX)

**References:** 


https://docs.python.org/3/library/re.html

RegEx
https://pypi.org/project/regex/


Before the start this lesson. Try to write a code to extract only the digits from this String.
str = "abc**00123**xyz**456**_**0**"

answer = ['00123', '456', '0']

## 1. What is a Regular Expression?


A regular expression in a programming language is a special text string used for describing a search pattern. 

It is extremely useful for extracting information from text such as code, files, log, spreadsheets or even documents.

<pre>
\    Used to drop the special meaning of character following it

^    Matches the beginning

$    Matches the end

.    Matches any character except newline

?    Matches zero or one occurrence.

|    Means OR (Matches with any of the characters separated by it.

*    Any number of occurrences (including 0 occurrences)

+    One ore more occurrences

{}   Indicate number of occurrences of a preceding RE to match.

()   Enclose a group of REs

</pre>

## Methods

re.match(), re.search() and re.findall() are methods of the Python module re.

### The re.match() method

The re.match() method finds match if it occurs at start of the string. For example, calling match() on the string ‘FS Forward School FS’ and looking for a pattern ‘FS’ will match.  



In [None]:
import re

data='FS Forward School FS'

result = re.match(r'FS', data)

print(result.group(0))

### The re.search() method

The re.search() method is similar to re.match() but it doesn’t limit us to find matches at the beginning of the string only. 

In [None]:
import re

result = re.search(r'School', 'FS Forward School FS')

print(result)
print(result.group(0))

### The re.findall() method

The re.findall() helps to get a list of all matching patterns. It searches from start or end of the given string. If we use method findall to search for a pattern in a given string it will return all occurrences of the pattern. While searching a pattern, it is recommended to use re.findall() always, it works like re.search() and re.match() both.


In [None]:
import re

result = re.findall(r'FS', 'FS Forward School FS')

print(result)


In [None]:
import re   # Need module 're' for regular expression

# Try find: re.findall(regexStr, inStr) -> matchedSubstringsList

# r'...' denotes raw strings which ignore escape code, i.e., r'\n' is '\'+'n'

re.findall(r'[0-9]+', 'abc123xyz')

# Return a list of matched substrings

In [None]:
re.findall(r'[0-9]+', 'abcxyz')
# where are the numbers?

In [None]:
import re

re.findall(r'[0-9]+', 'abc00123xyz456_0')


In [None]:
import re

re.findall(r'\d+', 'abc00123xyz456_0')

##\d represents any decimal digit, which is equivalent to the set [0-9]

In [None]:
# Try substitute: re.sub(regexStr, replacementStr, inStr) -> outStr
import re

re.sub(r'[0-9]+', r'0', 'abc00123xyz456_0')

In [None]:
# Try substitute with count: re.subn(regexStr, replacementStr, inStr) -> (outStr, count)
import re

re.subn(r'[0-9]+', r'*', 'abc00123xyz456_0')

# Return a tuple of output string and count

### Texts

In [None]:
#  matching text that contain two keywords between characters.
import re

s1 = "I love Ronaldo"

regex=r"(I[a-zA-Z_0-9 ]*Ronaldo)"

match = re.findall(regex, s1)
print(match)


In [None]:
#  matching text that contain two keywords between characters.
import re

s2 = "I hate Ronaldo" 

regex=r"(I[a-zA-Z_0-9 ]*Ronaldo)"

match = re.findall(regex, s2)
print(match)


In [None]:
#  matching text that contain two keywords between characters.
import re

s3 = "I love Ronaldo, Ronaldo, I Dunno him, I Hate Ronaldo" 

regex=r"(I[a-zA-Z_0-9 ]*Ronaldo)"

match = re.findall(regex, s3)
print(match)


In [None]:
#  matching text that contain two keywords between characters.
import re

s4 = "I heard that somebody said Ronaldo that he heard that A was informed that B thought that C misunderstood Ronaldo"

regex=r"(I[a-zA-Z_0-9 ]*Ronaldo)"

match = re.findall(regex, s4)
print(match)


In [None]:
#  matching text that contain two keywords between characters.
import re

s5 = "I heard that somebody said Ronaldo, that he heard that A was informed that B thought that C misunderstood Ronaldo"

regex=r"(I[a-zA-Z_0-9 ]*Ronaldo)"

match = re.findall(regex, s5)
print(match)

#observe the comma

In [None]:
import re

s = "Apple Orange Watermelon"

regex=r"(Forward School)"

match = re.findall(regex, s)

print(match)

In [None]:
import re

s = "Apple Orange Watermelon Forward School"

regex=r"(Forward School)"

match = re.findall(regex, s)

print(match)

In [None]:
import re

s = "Apple Orange Watermelon Forward School"

regex=r"(Forward School)"

match = re.match(regex, s)

print(match)

In [None]:
import re

s = "Apple Orange Watermelon Forward School"

regex=r"(Apple Orange Watermelon)"

match = re.match(regex, s)

print(match)
print(match.group(0))

## 1. Pattern for Matching 000..255: Write a regular expression to match numbers from 000 to 255. Use the pattern ([01][0-9][0-9]|2[0-4][0-9]|25[0-5]).

## 2. Pattern for Matching 0 or 000..255: How would you construct a regular expression to match either a single '0' or any number from 000 to 255? The pattern is ([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]).

In [None]:
import re

# The regular expression pattern
pattern = r'([01][0-9][0-9]|2[0-4][0-9]|25[0-5])'

# Example string to search
text = "Test numbers: 000, 123, 234, 255, 256, 299"

# Find all matches
matches = re.findall(pattern, text)

# Print the results
print("Matches:", matches)

In [None]:
#example

import re

s = "000..255..278"

regex=r"([01][0-9][0-9]|2[0-4][0-9]|25[0-5])"

match = re.findall(regex, s)

print(match)

In [None]:
# A Python program to demonstrate working of re.match().  
import re  
    
# Lets use a regular expression to match a date string  
# in the form of Month name followed by day number  
regex = r"([a-zA-Z]+) (\d+)"
    
match = re.search(regex, "I was born on June 24")  
    
if match != None:  
    
    # We reach here when the expression "([a-zA-Z]+) (\d+)"  
    # matches the date string.  
    
    # This will print [14, 21), since it matches at index 14  
    # and ends at 21.   
    print("Match at index % s, % s" % (match.start(), match.end())) 
    
    # We us group() method to get all the matches and  
    # captured groups. The groups contain the matched values.  
    # In particular:  
    # match.group(0) always returns the fully matched string  
    # match.group(1) match.group(2), ... return the capture  
    # groups in order from left to right in the input string  
    # match.group() is equivalent to match.group(0)  
    
    # So this will print "June 24"  
    print("Full match: % s" % (match.group(0))) 
    
    # So this will print "June"  
    print("Month: % s" % (match.group(1))) 
    
    # So this will print "24"  
    print("Day: % s" % (match.group(2))) 
    
else:  
    print("The regex pattern does not match.") 


### Extract Email using Regex

<img src='./Images/Intro2.png'>

In [None]:
import requests
import re

url='https://mmuexpert.mmu.edu.my/fci'
 
# get the data    
data = requests.get(url)


emails = re.findall(r'([\d\w\.\-]+@[\d\w\.\-]+\.\w+)', data.text)

print(emails)

In [None]:
# print(data.text)

### Extract Phone Number using Regex

In [None]:
import requests
import re

# get the data
data = requests.get(url)

phones = re.findall(r'[0-9]{2}\-[0-9]{8}', data.text)
emails = re.findall(r'([\d\w\.\-]+\[at\]*[\d\w\.\-]+\.\w+)', data.text)

print(phones)

### Exercises

**Exercise 1**

Extract the Emails and Phone numbers from this page using RegEx

https://mmuexpert.mmu.edu.my/fci

In [None]:
import requests
import re

url='https://mmuexpert.mmu.edu.my/fci'
 
# get the data    
data = requests.get(url)


emails = re.findall(r'([\d\w\.\-]+@[\d\w\.\-]+\.\w+)', data.text)
phones = re.findall(r'[0-9]{2}\-[0-9]{8}', data.text)


print (emails)
print (phones)

In [None]:
print(phones)

### HOMEWORK: Question 1

-Create a dummy data in a string where it consists of malaysian number (both landline and phone).  
-Additionally add a wrong number to prove the pattern matching works  
-Use pattern matching to find all of them and ensure to factor both line number and phone.

### ANSWER for Question 1

In [31]:
import re
 
phone_data = "05-6886075 012-5069787 011-11234800 010-345987 03-88000088 03-123321"
 
landline_number = re.findall(r'(?<!\d)(?:[0-9]{2})-[0-9]{7,8}', phone_data)
mobile_number = re.findall(r'(?<!\d)(?:[0-9]{2}|[0-9]{3})-[0-9]{7,8}', phone_data)

print(f'Malaysian Landline Numbers: {landline_number}')
print(f'Malaysian Mobile Numbers: {mobile_number}')

Malaysian Landline Numbers: ['05-6886075', '03-88000088']
Malaysian Mobile Numbers: ['05-6886075', '012-5069787', '011-11234800', '03-88000088']


### Discussion

Is regex good for scraping non regular texts from Web pages?

Eg. Look for all text between <li> and </li> 

can work? yes but how usable?

Eg. Look for all names starting with Mr. and then extract the name. How?

We need a better way to extract the meta data and elements about the web page. 

### r'

https://docs.python.org/3/library/re.html

<pre>
Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal. Also, please note that any invalid escape sequences in Python’s usage of the backslash in string literals now generate a DeprecationWarning and in the future this will become a SyntaxError. This behaviour will happen even if it is a valid escape sequence for a regular expression.

The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.
</pre>


**Note:**

Alternative 3rd party regex implementation

https://pypi.org/project/regex/

Try it out own your own


In [None]:
# Example

teststring = 'this is \n a test'

print(teststring)


In [None]:
# Example escape character \ is now a raw string not an escape char

teststring = r'this is \n a test'

print(teststring)


In [None]:
# \b Word boundary, allow to perform "whole words only" search

import re

re.findall('\btest\b', 'test this is a test') # the backslash gets consumed by the python string interpreter


In [None]:
re.findall('\\btest\\b', 'test this is a test') # backslash is explicitly escaped and is passed through to re module


In [None]:
re.findall(r'\btest\b', 'test this is a test') # often this syntax is easier
