## Web Scraping

Web scraping is the term for using a program to download and process content from the Web.

Modules For Web Scraping<br/>

* Webbrowser
* Requests
* Beautiful Soup
* Selenium


In [1]:
import webbrowser
#The webbrowser module’s open() function can launch a new browser to a specified URL. 
webbrowser.open('http://inventwithpython.com/')

True

**Example Program:**

In [3]:
#!/usr/bin/env python3

#Python Script To Open Google Maps to Specific Location.

import webbrowser,sys,pyperclip

if (len(sys.argv)>1): #if the input is taken from the commandLine.
    #sys.argv[0] returns the program name that is mapit.py
    address = ' '.join(sys.argv[1:]) # joins the arguments with single spaces (' ') between them.

else:
    #To take the input from your clipboard.
    address = pyperclip.paste()

webbrowser.open('https://www.google.com/maps/place/' + address)

True

**The requests module lets you download files from the Web**<br/>
The requests.get() function takes a string of a URL to download and it returns a Response object, which contains the response that the web server gave for your request.

**Example Program:** *For Downloading and saving files to harddisk.*
    

In [5]:
#!/usr/bin/env python3

#Script to download and save files to harddrive using pythons request module.

import requests

res = requests.get('https://automatetheboringstuff.com/files/rj.txt')

try:

    res.raise_for_status()

except Exception as exc:

    print('There was a problem: %s' % (exc))

playFile = open('RomeoAndJuliet.txt', 'wb')

for chunk in res.iter_content(100000):
	playFile.write(chunk)

playFile.close()

**Checking for errors**<br/>

As you’ve seen, the Response object has a status_code attribute that can be checked against requests.codes.ok to see whether the download succeeded. A simpler way to check for success is to call the raise_for_status() method on the Response object. This will raise an exception if there was an error downloading the file and will do nothing if the download succeeded.<br/>

The raise_for_status() method is a good way to ensure that a program halts if a bad download occurs. This is a good thing: You want your program to stop as soon as some unexpected error happens. If a failed download isn’t a deal breaker for your program, you can wrap the raise_for_status() line with try and except statements to handle this error case without crashing.

**Example Program**


In [6]:
import requests

res = requests.get('http://inventwithpython.com/page_that_does_not_exist')
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))

There was a problem: 404 Client Error: Not Found for url: http://inventwithpython.com/page_that_does_not_exist


**Parsing HTML with the BeautifulSoup Module**<br/>
Beautiful Soup is a module for extracting information from an HTML page.<br/>
*Example for an HTML page is given below*

![](s.png)


*Parsing the above HTML page usig BeautifulSoup*

In [40]:
import bs4
import requests

file=requests.get('https://nostarch.com/')
if(file.status_code == requests.codes.ok): #Another way to check if the download was a success
    obj=bs4.BeautifulSoup(file.text)
    elems=obj.select('#skip-link') # ('Here input the tag to select') 
    print(elems[0].getText(),"\n")
    
    print(elems[0].attrs,"\n") #give the attributes
        
    str(elems) #elems is a list so converting to string
    print(elems,"\n")
    
     #To get the data
    print(elems[0].get('id'),"\n")
    
    elems1=obj.select('script')
    print(len(elems1))
    str(elems1)
    print(elems1[12].get('src')) #The last script tag in the image given above.
    
else:
    print("Error 404")
    
'''note: soup.select('div span') 
         All elements named <span> that are within an element named <div>'''



Skip to main content
 

{'id': 'skip-link'} 

[<div id="skip-link">
<a class="element-invisible element-focusable" href="#main-content">Skip to main content</a>
</div>] 

skip-link 

13
https://nostarch.com/sites/default/files/js/js_MRdvkC2u4oGsp5wVxBG1pGV5NrCPW3mssHxIn6G9tGE.js


"note: soup.select('div span') \n         All elements named <span> that are within an element named <div>"

**Example** : Python script to automatically open a browser with all the top search results for a given input in new tabs.

In [8]:
import sys
import requests
import bs4
import webbrowser

wpage=requests.get('https://www.google.com/search?ei=6ZHmXJHfIsnSvwSU87eoCA&q=virat+kohli&oq=virat&gs_l=psy-ab.1.0.0i67l2j0j0i67j0j0i67j0j0i67j0l2.285694.298322..299640...6.0..0.184.1723.0j12......0....1..gws-wiz.....6..0i71j35i39j0i131j0i131i67j0i10.njTLT2cDnYk')

In [32]:
obj=bs4.BeautifulSoup(wpage.text,"lxml")
links=obj.select('cite')
for i in range(len(links)):
    str(links[i])
    webbrowser.open(str(links[i].getText()))
    


**Selenium Module**<br/>

*The selenium module lets Python directly control the browser by programmatically clicking links and filling in login information, almost as though there is a human user interacting with the page. Selenium allows you to interact with web pages in a much more advanced way than Requests and Beautiful Soup; but because it launches a web browser, it is a bit slower and hard to run in the background*

In [None]:
from selenium import webdriver
browser = webdriver.Chrome()

*In selenium there are two methods for finding elements on a page.They are 'fine_element_by*' and 'find_elements_by*' methods. The find_element_by* methods return a single WebElement object, representing the first element on the page that matches your query. The find_elements_* methods return a list of WebElement_* objects for every matching element on the page.*

#### Example: browser.find_element_by_name(name)
*Returns Elements with a matching name attribute value*


### Important Attribute
- tag_name
- get_attribute(name)
- text
- click()
- send_keys() and submit()
- back()
- forward()
- refresh()
- quit()