<h3>Python libraries for web-scraping:</h3><br>1. Requests<br>2.Beautiful Soup<br>3.Selenium<br>4.Scrapy

<h3>Requests:</h3>As a preparation step to parsing, we can use Requests to download HTML and other files from the internet.

In [45]:
import requests

In [46]:
r = requests.get('https://api.github.com/events')
r.text



HTTP POST request:

In [47]:
r = requests.post('https://httpbin.org/post', data = {'key':'value'})
r.text

'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "key": "value"\n  }, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Content-Length": "9", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.21.0"\n  }, \n  "json": null, \n  "origin": "106.222.241.12, 106.222.241.12", \n  "url": "https://httpbin.org/post"\n}\n'

Other HTTP request types: PUT, DELETE, HEAD,OPTIONS

In [48]:
r = requests.put('https://httpbin.org/put', data = {'key':'value'})
r.text


'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "key": "value"\n  }, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Content-Length": "9", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.21.0"\n  }, \n  "json": null, \n  "origin": "106.222.241.12, 106.222.241.12", \n  "url": "https://httpbin.org/put"\n}\n'

In [49]:
r = requests.delete('https://httpbin.org/delete')
r.text

'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.21.0"\n  }, \n  "json": null, \n  "origin": "106.222.241.12, 106.222.241.12", \n  "url": "https://httpbin.org/delete"\n}\n'

In [50]:
r = requests.head('https://httpbin.org/get')
r.text

''

In [51]:
r = requests.options('https://httpbin.org/get')
r.text

''

<h5>Passing Parameters In URLs</h5><br>We often want to send some sort of data in the URL’s query string. If we are constructing the URL by hand, this data would be given as key/value pairs in the URL after a question mark, e.g. httpbin.org/get?key=val. Requests allows us to provide these arguments as a dictionary of strings, using the params keyword argument.

In [52]:
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get('https://httpbin.org/get', params=payload)
print(r.url)

https://httpbin.org/get?key1=value1&key2=value2


That any dictionary key whose value is null will not be added to the URL's query string and we can also pass a list of items as a value

In [53]:
payload = {'key1': 'value1', 'key2': ['value2','value3'],'key3': None}
r = requests.get('https://httpbin.org/get', params=payload)
print(r.url)

https://httpbin.org/get?key1=value1&key2=value2&key2=value3


We can find out what encoding Requests is using, and change it, using the r.encoding property:

In [54]:
print(r.encoding)
r.encoding = 'ISO-8859-1'
print(r.encoding)

None
ISO-8859-1


<h5>Binary Response Content</h5>

In [55]:
r.content

b'{\n  "args": {\n    "key1": "value1", \n    "key2": [\n      "value2", \n      "value3"\n    ]\n  }, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.21.0"\n  }, \n  "origin": "106.222.241.12, 106.222.241.12", \n  "url": "https://httpbin.org/get?key1=value1&key2=value2&key2=value3"\n}\n'

<h5>JSON Response Content</h5>

In [56]:
r = requests.get('https://api.github.com/events')
r.json()

[{'id': '9878243127',
  'type': 'CreateEvent',
  'actor': {'id': 27856297,
   'login': 'dependabot-preview[bot]',
   'display_login': 'dependabot-preview',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/dependabot-preview[bot]',
   'avatar_url': 'https://avatars.githubusercontent.com/u/27856297?'},
  'repo': {'id': 183613479,
   'name': 'emittr/emittr',
   'url': 'https://api.github.com/repos/emittr/emittr'},
  'payload': {'ref': 'dependabot/npm_and_yarn/tslint-5.18.0',
   'ref_type': 'branch',
   'master_branch': 'master',
   'description': 'A framework agnostic way of building Event Dispatchers for TypeScript.',
   'pusher_type': 'user'},
  'public': True,
  'created_at': '2019-06-24T07:03:09Z',
  'org': {'id': 51016026,
   'login': 'emittr',
   'gravatar_id': '',
   'url': 'https://api.github.com/orgs/emittr',
   'avatar_url': 'https://avatars.githubusercontent.com/u/51016026?'}},
 {'id': '9878243124',
  'type': 'CreateEvent',
  'actor': {'id': 23121826,
   'login': '

It should be noted that the success of the call to r.json() does not indicate the success of the response. Some servers may return a JSON object in a failed response (e.g. error details with HTTP 500). Such JSON will be decoded and returned. To check that a request is successful, use r.raise_for_status() or check r.status_code is what you expect.

In [57]:
r.status_code

200

<h5>Raw Response Content</h5>

In [58]:
r = requests.get('https://api.github.com/events', stream=True)
r.raw
r.raw.read(10)

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

Using Response.iter_content will handle a lot of what that we would otherwise have to handle when using Response.raw directly. When streaming a download, the above is the preferred and recommended way to retrieve the content.<strong> Response.iter_content will automatically decode the gzip and deflate transfer-encodings. Response.raw is a raw stream of bytes – it does not transform the response content. If you really need access to the bytes as they were returned, use Response.raw.</strong>

In [59]:
# with open(filename, 'wb') as fd:
#     for chunk in r.iter_content(chunk_size=128):
#         fd.write(chunk)

<h5>Custom Headers

In [60]:
url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}
r = requests.get(url, headers=headers,stream = True)
r.raw

<urllib3.response.HTTPResponse at 0x1ca57671c88>

<h5>More complicated POST requests</h5>
Typically, we want to send some form-encoded data — much like an HTML form. To do this, simply pass a dictionary to the data argument. Your dictionary of data will automatically be form-encoded when the request is made:

In [61]:
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("https://httpbin.org/post", data=payload)
print(r.text)
print(r.url)

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.21.0"
  }, 
  "json": null, 
  "origin": "106.222.241.12, 106.222.241.12", 
  "url": "https://httpbin.org/post"
}

https://httpbin.org/post


The data argument can also have multiple values for each key. This can be done by making data either a list of tuples or a dictionary with lists as values. This is particularly useful when the form has multiple elements that use the same key:

In [68]:
payload_tuples = [('key1', 'value1'), ('key1', 'value2')]
r1 = requests.post('https://httpbin.org/post', data=payload_tuples)
payload_dict = {'key1': ['value1', 'value2']}
r2 = requests.post('https://httpbin.org/post', data=payload_dict)
print(r2.text)

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": [
      "value1", 
      "value2"
    ]
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.21.0"
  }, 
  "json": null, 
  "origin": "106.222.139.194, 106.222.139.194", 
  "url": "https://httpbin.org/post"
}



<h5>POST a Multipart-Encoded File</h5>We can do in this way

In [69]:
# url = 'https://httpbin.org/post'
# files = {'file': open('report.xls', 'rb')}
# r = requests.post(url, files=files)
# r.text

We can set the filename,content_type and headers explicitly

In [70]:
# url = 'https://httpbin.org/post'
# files = {'file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel', {'Expires': '0'})}
# r = requests.post(url, files=files)
# r.text

We can also send strings to be received as files


In [71]:
url = 'https://httpbin.org/post'
files = {'file': ('report.csv', 'some,data,to,send\nanother,row,to,send\n')}
r = requests.post(url, files=files)
r.text

'{\n  "args": {}, \n  "data": "", \n  "files": {\n    "file": "some,data,to,send\\nanother,row,to,send\\n"\n  }, \n  "form": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Content-Length": "184", \n    "Content-Type": "multipart/form-data; boundary=69f245642dac47c1e1572dcc861f6406", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.21.0"\n  }, \n  "json": null, \n  "origin": "106.222.202.242, 106.222.202.242", \n  "url": "https://httpbin.org/post"\n}\n'

<h5>Response Headers</h5>

In [72]:
r.headers

{'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Content-Encoding': 'gzip', 'Content-Type': 'application/json', 'Date': 'Mon, 24 Jun 2019 07:15:39 GMT', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Server': 'nginx', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'DENY', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '306', 'Connection': 'keep-alive'}

HTTP Header names are case-insensitive.So, we can access the headers using any capitalization we want:

In [73]:
 r.headers['Content-Type']

'application/json'

In [74]:
r.headers.get('content-type')

'application/json'

<h5>Cookies: </h5>If a response contains some Cookies, you can quickly access them:

In [75]:
url = 'http://example.com/some/cookie/setting/url'
r = requests.get(url)
r.cookies

<RequestsCookieJar[]>

To send your own cookies to the server, you can use the cookies parameter:

In [76]:
url = 'https://httpbin.org/cookies'
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies)
r.text

'{\n  "cookies": {\n    "cookies_are": "working"\n  }\n}\n'

Cookies are returned in a RequestsCookieJar, which acts like a dict but also offers a more complete interface, suitable for use over multiple domains or paths. Cookie jars can also be passed in to requests:

In [77]:
jar = requests.cookies.RequestsCookieJar()
jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies')
jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere')
url = 'https://httpbin.org/cookies'
r = requests.get(url, cookies=jar)
r.text

'{\n  "cookies": {\n    "tasty_cookie": "yum"\n  }\n}\n'

In [78]:
r.history

[]

<h3>Beautiful soup</h3>Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with our favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.<br><br> I’ll be using as an example throughout this document.

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [92]:
from bs4 import BeautifulSoup
html_report_part1 = open('html_doc.txt','r').read()
soup = BeautifulSoup(html_report_part1, "html.parser")
print(soup.prettify())


html_doc = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
  """
 </body>
</html>


In [94]:
soup.p


<p class="title"><b>The Dormouse's story</b></p>

In [95]:
soup.a


<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [96]:
soup.find_all('a')


[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [97]:
for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [98]:
print(soup.get
      _text())

html_doc = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""


<h4>Making the soup</h4>

In [101]:
from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp)

soup = BeautifulSoup("<html>data</html>")
soup

<html><body><p>data</p></body></html>

First, the document is converted to Unicode, and HTML entities are converted to unicode characters:

In [103]:
BeautifulSoup("Sacr&eacute; bleu!")

<html><body><p>Sacré bleu!</p></body></html>

Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser.

<h3>Kinds of object</h3>Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects:<br> 1: Tag<br> 2: NavigableString<br>3: BeautifulSoup<br>4:Comment.

<h4>Tag</h4>A Tag object corresponds to an XML or HTML tag in the original document:

In [104]:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)

bs4.element.Tag

 The most important features of a tag are its <strong>name</strong> and <strong>attributes</strong>.

<h4>Name</h4> Every tag has a name, accessible as .name:



In [105]:
tag.name

'b'

If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:

In [107]:
tag.name = "blockquote"
tag

<blockquote class="boldest">Extremely bold</blockquote>

<h4>Atrribute</h4>A tag may have any number of attributes. The tag <b id="boldest"> has an attribute “id” whose value is “boldest”. 

In [111]:
tag.attrs

{'class': ['boldest']}

We can add, remove, and modify a tag’s attributes.This is done by treating the tag as a dictionary:

In [112]:
tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag

<blockquote another-attribute="1" class="boldest" id="verybold">Extremely bold</blockquote>

In [113]:
del tag['id']
del tag['another-attribute']
tag

<blockquote class="boldest">Extremely bold</blockquote>

<h4>Multi-valued attributes</h4>HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel, rev, accept-charset, headers, and accesskey. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:

In [115]:
css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']

['body']

When you turn a tag back into a string, multiple attribute values are consolidated:

In [117]:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
print(rel_soup.a['rel'])
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)

['index']
<p>Back to the <a rel="index contents">homepage</a></p>


<h4>Navigable string</h4>A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:

In [119]:
tag.string

'Extremely bold'

In [120]:
type(tag.string)

bs4.element.NavigableString

We can’t edit a string in place, but we can replace one string with another, using replace_with():

In [126]:
tag.string.replace_with("No longer bold")
tag

<blockquote class="boldest">No longer bold</blockquote>

<b>If we want to use a NavigableString outside of Beautiful Soup, we should call unicode() on it to turn it into a normal Python Unicode string. If we don’t, our string will carry around a reference to the entire Beautiful Soup parse tree, even when we’re done using Beautiful Soup. This is a big waste of memory.</b>

<h4>BeautifulSoup</h4>Since the BeautifulSoup object doesn’t correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it’s useful to look at its .name, so it’s been given the special .name

In [127]:
soup.name

'[document]'

<h4>Comments and other special strings</h4>

In [133]:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
print(soup)
comment = soup.b.string
print(comment)
type(comment)
print(soup.b.prettify())

<html><body><b><!--Hey, buddy. Want to buy a used parser?--></b></body></html>
Hey, buddy. Want to buy a used parser?
<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>


In [134]:
from bs4 import CData
cdata = CData("A CDATA block")
comment.replace_with(cdata)

print(soup.b.prettify())

<b>
 <![CDATA[A CDATA block]]>
</b>


<h3>Selenium</h3>Unlike Requests/Beautiful Soup, Selenium opens a visible browser window when we run the code. It can be used to simulate mouse clicks and key presses, as well as select elements of the page. One of the main use cases for this library is testing a website during development.

<b>The selenium.webdriver module provides all the WebDriver implementations. Currently supported WebDriver implementations are Firefox, Chrome, IE and Remote. The Keys class provide keys in the keyboard like RETURN, F1, ALT etc.</b>

In [60]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

Next, the instance of Firefox WebDriver is created.

In [61]:
driver = webdriver.Chrome()

The driver.get method will navigate to a page given by the URL. WebDriver will wait until the page has fully loaded before returning control to script. 

In [62]:
driver.get("http://www.python.org")

The next line is an assertion to confirm that title has “Python” word in it:

In [63]:
assert "Python" in driver.title

WebDriver offers a number of ways to find elements using one of the find_element_by_* methods.

In [64]:
elem = driver.find_element_by_name("q")

Next, we are sending keys, this is similar to entering keys using our keyboard. Special keys can be sent using Keys class imported from selenium.webdriver.common.keys. 

In [65]:
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)

After submission of the page, you should get the result if there is any. To ensure that some results are found, make an assertion:

In [66]:
assert "No results found." not in driver.page_source

Finally, the browser window is closed. You can also call quit method instead of close. The quit will exit entire browser whereas close` will close one tab, but if just one tab was open, by default most browser will exit entirely.:

In [56]:
driver.close()

<h4>Using Selenium to write tests</h4>

In [67]:
import unittest
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

The test case class is inherited from unittest.TestCase. Inheriting from TestCase class is the way to tell unittest module that this is a test case:

In [68]:
class PythonOrgSearch(unittest.TestCase):
#  The setUp is part of initialization, this method will get called before every test function which you are going to write in this test case class.   
    def setUp(self):
        self.driver = webdriver.Chrome()
# This is the test case method. The test case method should always start with characters test. The first line inside this method create a local reference to the driver object created in setUp method.
    def test_search_in_python_org(self):
        driver = self.driver
        driver.get("http://www.python.org")
        self.assertIn("Python", driver.title)
        elem = driver.find_element_by_name("q")
        elem.send_keys("pycon")
        elem.send_keys(Keys.RETURN)
        assert "No results found." not in driver.page_source
# The tearDown method will get called after every test method. This is a place to do all cleanup actions. In the current method, the browser window is closed. 
    def tearDown(self):
        self.driver.close()

In [59]:
if __name__ == "__main__":
    unittest.main()

E
ERROR: C:\Users\ACER\AppData\Roaming\jupyter\runtime\kernel-54174dad-2939-44b5-9943-bfab6b6dc0e2 (unittest.loader._FailedTest)
----------------------------------------------------------------------
AttributeError: module '__main__' has no attribute 'C:\Users\ACER\AppData\Roaming\jupyter\runtime\kernel-54174dad-2939-44b5-9943-bfab6b6dc0e2'

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)


SystemExit: True

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
